Methods for Data Science Coursework 1

0. Introduction

Let us first store the functions that will be used throughout the coursework:

1) MSE to compute the mean squared error (MSE) by inputting the actual response variable (y) and the predicted response variable (y_pred)

2) cross_val_split to split our training data into a number of folds for cross validation by inputting our data (data) and the number of folds we want (num_folds)

3) r2_score to compute the coefficient of determination of the model by inputting the actual response variable from test set (y_test) and the predicted response variable (y_pred)

4) standardize to standardize a matrix X

TASK 1: REGRESSION

1.0 Preliminaries

As an introduction to this regression task, we will first of all standardize our data that we will be using throughout the exercise, then we will be splitting our train and test set into the descriptors (design matrix) and the response variable y.

Standardization of descriptors

This is a key step in processing our data. Indeed, the descriptors don't necessarily have the same scale which can lead to large discrepencies in terms of contribution to the model and hence to its accuracy. For example, if one of the independant variables has a higher scale than the other variables, the model will be lead by the latter, and the influence of the other descriptors will be overlooked.

Splitting

Now we are splitting our train set and test sets as follows:

Train set: (data on which we operate all kinds of optimisation methods)

Validation set: (note that the validation set data shall only be used for accuracy comparison and never any optimisation will be pursued on it)

This is necessary for the 3 methods that we will be implementing: linear regression, ridge regression and kNN.

Interpretation of response variable against 1 predictor

This plot indicates that fitting a linear model might be a good first approach to estimate predictions. However note that this is the plot of only one predictor against the response variable.

What one could do is look at the correlation matrix of the predictors, as follows. It is useful to find relationship between predictors and look at the distribution of each predictors.

Interpretation of pairplot

The diagonal entries of this matrix allow us to see the distribution of a single predictor via the plot of its histogram. The scatter plots on the upper and lower diagonal entries allow us to see the relationship or not of 2 variables. For example, predictor 16 and predictor 7 are linearly correlated, whilst predictor 14 and predictor 8 seem to have no relationship whatsover. (3)

1.1 LINEAR REGRESSION

1.1.1

In this first task, we obtain a linear regression to predict the median value of owner-occupied homes in USD 1000s as a response variable Y and the other features as predictors part of the design matrix X. We also display the mean squared error (MSE) from the training set.

In this exercise we are solving the following equation for beta: $$ \underset{\boldsymbol\beta}{\text{min}} \| \mathcal Y - \mathcal X \boldsymbol\beta \|^2 = \underset{\boldsymbol\beta}{\text{min}} \ \text{L}_{\text{LS}} (\boldsymbol\beta) $$ where $\text{L}_{\text{LS}}$ is the (ordinary) least squares loss function.

First, let us define the necessary functions for this task:

1) max_lik_estimate, which computes the maximum likelihood estimator (MLE) and is solution to the above equation $$ \beta_{\text{ML}} = (X^TX)^{-1}X^Ty $$ where X is the design matrix, and y the response variable, by lecture notes.

2) predict_with_estimate, which computes the predicted y's for our estimated beta: $$ y_{\text{pred}} = X.\beta_{\text{ML}} $$

Now, let us compute the said beta for our training dataset, and report the parameters of our model, namely the intercept and the linear regression slope.

Now, let us compute the prediction for our model on our train dataset: y_pred_train and its in sample MSE:

Now let us look at the residuals plot for the train data which analyzes the variance of the error of the regressor. We can see that our points are roughly randomly spread around the horizontal axis which means that using a linear regression model is appropriate and is performing quite well.

This plot of residuals shows some kind of behaviour around y=0, which is not exactly uniform, as there is a significant amount of data points not centered around zero for high values of y. We might wonder whether we could use a different method to estimate y.

1.1.2

Now, we use the model we have computed in 1.1.1 and we use it on our test data to predit the target variable y_pred_test and compute its out-of-sample MSE on the test set.

Similarly as 1.1.1, this plot of residuals shows good fit of our model against validation set data.

This plot of residuals shows some kind of behaviour around y=0, which is not exactly uniform. We might wonder whether we could use a different method to estimate y.

Comparison of the out-of-sample and the in-of-sample MSE

We get a MSE of 24.36 for in of sample and 19.84 for out of sample. This means that our test set fits the data better than our training set. This seems a bit odd so we might wonder why that is the case. Maybe the data hasn't been split properly or isn't clean or maybe the standard least squares isn't the best method to apply here. Indeed the residuals show show significant outliers for large values of y which might indicate that a non linear method might be more appropriate. Regardless the difference between these two MSE is quite low which is a good indicator that our model isn't overfitting.

1.2 RIDGE REGRESSION

1.2.1

In this task, we are using a 5-fold cross validation algorithm to tune our ridge model with the penalty parameter.

Essentially, we want to solve this equation for beta: $$ \underset{\boldsymbol\beta}{\text{min}} \| \mathcal Y - \mathcal X \boldsymbol\beta \|^2 + \lambda \| \boldsymbol\beta \|^2 = \underset{\boldsymbol\beta}{\text{min}} \ \text{L}_{\text{ridge}} (\boldsymbol\beta) $$ where $\text{L}_{\text{ridge}}$ is the ridge loss function.

This time, we have a parameter: lambda or the penalty, that we will be tuning using 5-fold cross validation.

Let us first of all define the necessary function to implement this task:

1) ridge_estimate is a function which computes the ridge beta (solution of the above equation) as follows:

$$ \boldsymbol\beta_{\text{ridge}} = (\boldsymbol X^T\boldsymbol X + \lambda I)^{-1}\boldsymbol X^T\boldsymbol y \, . $$

where X is the design matrix, lambda the penalty parameter always positive, I the identity matrix and y the response variable.

2) cross_val_evaluate_r is a function which evaluates the in-sample and out-of-sample MSE in our data for the 5-fold cross validation

Interpretation of plot of training and validation MSE for fold #4 against values of lambda penalty

We can see on this plot of training and validation MSE for fold #4 that there clearly is a value of penalty lambda corresponding to the minimum of the validation MSE. In the next cell we will be computing this minimal penalty value for all the folds.

Hence we get that the optimal penalty value is 3 after having tuned our penalty over the 5 folds and performed cross-validation.

1.2.2

Here we want to compute the average in-sample and out-of-sample MSE over the 5 folds and compare to standard linear regression.

Interpretation of residual plots

These 2 residual plots for training and test set residuals are very similar to the ones from standard linear regression. Actually this is not so surprising seeing as our predicted response variable are extremely similar too.

Comparison of the out-of-sample and the in-of-sample MSE

We get a MSE of 24.58 for in of sample and 19.50 for out of sample. This is very close to what we get in standard linear regression. It seems that our test set fits the data better than our training set. This seems a bit odd so we might wonder why that is the case. Maybe the data hasn't been split properly or isn't clean or maybe a other regression methods could be applied, like non linear methods, seeing as the residuals are not exactly uniformly distributed around y=0. Regardless the difference between these two MSE is quite low which is a good indicator that our model isn't overfitting. Additionnaly, the out-of-sample MSE for ridge regression is slightly lower than for the standard linear regression, which is a result that makes sense as we would expect ridge regression to be better fitting than standard linear regression because of this penalty parameter that we have tuned.

Differences between the two methods in terms of particular predictors of interest

Interpretation of plot of betas

The main goal of ridge regression is to shrink the betas which makes them more stable, as we can see in the plot above, the blue line (flat) represents the consistency of beta parameters for all the predictors. Whereas in standard linear regression, we can see a large discrepency in values of beta especially for column 2, 4, 14 and 16. This can explain the slight difference in MSE obtained between the two models, as ridge has an out of sample MSE slightly lower than standard linear regression.

1.3 REGRESSION WITH K NEAREST NEIGHBOURS (KNN)

In this task, we are implementing the kNN algorithm as a regression model. Note that this is a non linear regression method.

We compare its performance to the other regression models we implemented. kNN algorithm has input the k closest training data points in the dataset and it outputs the property value of the object which is the average of the values of the k nearest neighbours. So we want to tune that k to get the lowest MSE.

Interpretation of data exploration

This plot of the data points of response variable y against one of the features allows us to visually see what we are going to do for kNN, which is computing distances between training set data points and test set data points essentialy. Once again, red points are for test data and blue points are for train data.

1.3.1

Let us first of all define the necessary function to implement this task:

1) euclidian_distance is a function which computes the euclidian distance between data points as follows:

$$ d(\boldsymbol p, \boldsymbol q) = \sqrt{\sum_{i=1}^D{(q_i-p_i)^2}} \, , $$

2) k_neighbours is a function which finds the k nearest neighbours in our train set for every test data point

3) reg_predict is a function which computes the y_pred for our model

4) cross_val_evaluate_knn is a function which computes the MSE over the 5 cross validation folds for cross validation train and test data

In the following cell, we are tuning our paramater k with respect to our 5 fols cross validation

Interpretation of plot of training and validation set MSE for fold #4 against number of neighbours

We can see on this plot of training and validation MSE for fold #4 that there clearly is a number of neighbours corresponding to the minimum of the validation MSE. In the next cell we will be computing this number of neighbours for minimal MSE for all the folds.

For fold #4 as plotted above, the optimal value for k is 2, where is MSE is minimal for the validation set. Then by looking into val_k in the fourth row, we can see that the minimal corresponding MSE is 7.42.

Analysis of distribution of errors for best k over folds 3 and 4

Analysis of errors of kNN predicted response var

We can see that for the different best number of neighbours that we get out of our 5 fold cross validation (k=2 & k=3), our errors (response variable y from validation set - predicted response variable y) are distributed in a bell shape curve with mean of approximatively zero (0.09 & 0.02) which seems Gaussian with Std of 3.52 and 3.73. This could be the case because of the Central limit theorem.

1.3.2

In this task, we are computing the in sample and out of sample MSE over the 5 folds, and comparing to values from standard linear and ridge regression. We also compare the performance of the kNN algorithm to the 2 other models using the coefficient of determination, the distribution of errors and comparison of MSE.

Comparison of the out-of-sample and the in-of-sample MSE, and with other models

In kNN, we do not assume linearity of data. As the previous residual plots have shown us, there are points in higher values of y that make us think that the data is not uniformly distributed around 0 over the whole range of y. This makes us think that a linear model might not be the best idea to fit this data. Actually the MSE results we are getting for kNN confirm this hypothesis, as we are getting much much smaller values (for kNN 13 in val set vs 19 for ridge and linear in val set). A point to note however is that in-of-sample and out-of-sample are significantly different which could mean overfitting.

Interpretation of residual plots

Clearly in the residual plots for kNN predictors, we can see that the residuals are more uniformly distributed around y=0 than the residuals for standard linear regression and ridge regression. This could be that a non linear model such as kNN is a better model for our data for example.

Compute distribution of errors for standard linear regression and ridge regression

This time, we can see that the errors for the predicted response variables done by standard linear regression and ridge regression are less smooth than the one predicted by kNN. Indeed, their mean is a bit off zero with higher standard deviation. We can explore why is that in the comparison of performance of models below. We can make as a hypothesis that our kNN model might be the better of the three.

Compare performance of models

Computing the R squared or coefficient of determination is looking at the proportion of the variance in the dependent variable that is predictable from the independent variables. We can compare these values are we are not adding any descriptors to the 3 models tested, so it makes sures that R2 is comparable. Indeed if we were to add descriptors to one of the models, naturally R2 would increase and in that case we would need to look at the adjusted R2 tool.

In any case, we can see that kNN has the highest R2 score on its validation set which would be attractive. However, we know that its in sample MSE and out of sample MSE differ significantly which could mean overfitting. Also from our plots of residuals from the 3 models, the kNN seem more uniformly distributed, which could confirm the hypothesis that some of our descriptors are nonlinear (see pairplot in the beginning of this task), as kNN is a non linear method.

Regarding the homogeneity of the data, as stated by our analysis of the residuals which is not quite uniform, we could perhaps say that our data is not exactly homogeneous.

TASK 2: CLASSIFICATION

2.1 LOGISTIC REGRESSION

In this task, we wish to implement a logistic regression classifier which is used to classify data points binarily using the logistic function: $$ f(x) = \frac{1}{1+e^{-x}} $$

2.1.0 Import data

First of all, let's import the data, define predictors and response variables and standardize.

2.1.1

This exercise is based around solving the following equation: $$ \hat{\boldsymbol y}_{\text{log}} = f(\boldsymbol \beta^T \boldsymbol X + \beta_0) $$ where $\boldsymbol X = [X^{(1)}, X^{(2)}, \dots, X^{(n)}]$, and $X^{(i)} \in \mathbb R^d$ and f being the logistic function.

Now, let us define the major functions that we will be using for training our logistic regression classifier:

1) logistic is a function which computes the logistic function for a given x

2) predict_log is a function which computes the y from the equation above

3) initialise is a function our parameters beta and betazero randomly

Next, we implement a forward pass, also called propagate which implements the cost function and its derivatives. We use the derivatives as part of a gradient descent to optimise the cost function. Cost function: $$ \mathcal L = - \frac{1}{n} \sum_{i=1}^n y^{(i)} \log(\hat{y}_{\text{log}}^{(i)}) + (1-y^{(i)}) \log (1-\hat{y}_{\text{log}}^{(i)}) \, . $$ Partial derivatives: $$ \frac{\partial \mathcal L}{\partial \boldsymbol \beta} = \frac{1}{n} \sum_{i=1}^n ( y^{(i)} - \hat{y}_{\text{log}}^{(i)}) X^{(i)} $$

$$ \frac{\partial \mathcal L}{\partial \beta_0} = \frac{1}{n} \sum_{i=1}^n ( y^{(i)} - \hat{y}_{\text{log}}^{(i)}) $$

Now, in the next cell, let us optimise the model with a defined learning rate which we will later tune, by calculating the parameters for our training sets.

Now in the following cell, we can predict the labels (1 or 0) for our test set, and hence do the classification.

Now let us define our model function which contains in its inputs all the hyperparameters, including the 2 that we wish to tune: learning rate and decision threshold, and returns a dictionnary with all the wanted information.

Now let's finally conduct our 5 fold cross validation that tunes parameters.

In the following cell, let us implement grid search to tune our parameters of learning rate and decision threshold:

2.1.2

In this task, we wish to compare the performance of our optimal model on the training data and on the test data by the mean accuracies. This is easily implemented via our model function which already returns the mean accuracies. Hence, we use this function with our optimized parameters from grid search.

The testing data mean accuracy is 74% vs 74.5% in train data which is a good sign because our model fits relatively well on test data loosing only 0.5 point in accuracy.

2.2 RANDOM FOREST

2.2.0 Import Data

First of all, let's import the data and split it up into predictors and response variable, and in the correct format (pandas Dataframe or Series).

2.2.1

In this task, we are asked to train a random forest classifier on our training data with 5 fold cross validation and tuning of 3 parameters: number of decision trees, depth of trees and maximum number of descriptors.

In order to do so, let's first define the entropy impurity criterion: (add source) $$ \text{CE}(\boldsymbol y) = - \sum_{i=1}^N \mathbb P (y_i)\log_2(y_i) $$

Secondly we are going to implement the following algorithm:

1) Bootstrapping: producing n_trees random sample from the training data set by random sampling with replacement

2) Ensemble of models: for each of our n_trees samples, we get a decision tree

3) Aggregating for classification with uses the ensemble of the model to predict our variable

Let's now define the functions that we will be calling later in our code:

1) cross_entropy computes the cross entropy for a given vector of training labels (1/0 in the case of binarity like here)

2) split_dataset2 which returns the split of data whose column-th feature equals value

3) cross_entropy_purification which returns the entropy impurity criterion which we use for the splits

4) choose_best_feature2 which chooses the best feature to split by calling the function cross_entropy_purification for each feature

5) majority_votewhich returns the label which appears the most in our label variable y

Now that we have all these functions defined, we can finally build our trees and forest as follows:

1) build_tree2 which returns a tree according to the data

2) train which builds the tree according to the training data

3) classify2 which classifies a single sample with the fitted decision tree.

4) classify2_forest which classifies a sample with a forest

5) predict which predicts the classification results for X

6) rand_s which randomly without replacement picks a sample from X and y

7) forest which creates the forest

Now we are ready to implement a 5 fold cross validation over our dataset as follows:

Let's implement a grid search over our 3 parameters, namely the depth of trees, the maximum number of features chosen at each split (good starting point is 11/3 by lecture notes) and the number of decision trees

2.2.2 Performance of model on training and test data from confusion matrix

In this exercise, we are asked to compare the performance of our optimal model on test and train data, which we will do so by the use of a confusion matrix and the coefficient of determination.

Interpretation of MSE

We are getting a relatively okay out of sample accuracy of 68%, close to the in-sample accuracy. Note that this algorithm is only predicting ones, which is highly questionable as we have a faire amount of zeros in both our train and test response variables. The error is probably in my implementation of random forests to predict the classification, but unfortunately I haven't been able to spot it in time.

Compare performance of optimal model on training and test data

First of all, the out of sample accuracy is 68.5% and the in sample accuracy is 70.4%, which shows that our model is performing quite well on the test data. However, when we look at the confusion matrix, see that our model has predicted all ones. Which isn't a good sign if you are the bank for which we are analysing who can or can't be granted a credit, as this would be that everyone will be granted a credit...

However, using metrics from the testing data confusion matrix, we can compute the TN (true negatives: the entry of 0,0), the TP (true positive: the entry of 1,1), the FN (false negative: the entry of 1,0), the FP (false positive: the entry of 0,1). We are getting a false negative rate of 31.5%, which isn't ideal, considering that it means granting credit to applicants who don't match criteria, and potentially who won't be able to pay back the loans and eventually will make the bank loose money.

Precision score: TP / (TP+FP)

Recall score: TP / (TP+FN)

Accuracy: (TP+TN) / Total

2.3 SVM

2.3.0 Import data

First of all, I will be importing the data and setting a few parameters:

1) all numpy arrays,

2) the response variable shall have their zeros replaced by -1 as we are in the case of hard margin with a hyperplane seperating between ones and minus ones.

2.3.1

(i) First of all we will be implementing a hard margin standard linear SVM on the training data, using the following information:

So let's define the functions that we need to this first implementation of linear hard margin SVM:

1) compute_cost which computes the hinge loss defined above

2) calculate_cost_gradient which takes the gradient of cost function

3) sgd which is our model with all the parameters to compute the weights

Let's now define a score function which returns the accuracy of our model:

Interpretation of accuracy for hard margin linear SVM

We get an accuracy on the test set of 68% compared to 61% on the train set. We could wonder why is that the case, because usually the training set has a higher accuracy than the test set. Maybe the model is doing particularly well on that test set, and the way it was created is maybe not random.

(ii) Now implement with hard margin kernel SVM with radial basis funtions

We are now facing a slightly updated cost minimising problem where we wish to optimize gamma and b subject to a rbf kernel function:

$$ \mathcal L (\boldsymbol w_z ; b) = \frac{1}{2} \| \boldsymbol w_z \|^2 + \frac{\lambda}{n} \sum_{i=1}^n \max \bigg( 0, 1-y_i (\boldsymbol w_z \cdot z_i + b) \bigg) \, . $$

This time we are not including b in the predictors data as an intercept but rather we wish to optimise it. From lecture notes we have the following with k being the kernel: $$ \boldsymbol w_z \cdot z_i = \phi (\boldsymbol w_z) \cdot \phi (\boldsymbol x_i) = k (\boldsymbol w, \boldsymbol x_i) = \exp \bigg(- \frac{\Vert \boldsymbol w_z - \boldsymbol x_i \rVert^2 }{\sigma^2} \bigg) $$ To predict the class for some new x we have the following: $$ k (\boldsymbol w, \boldsymbol x{new}) + b \geq 0 \implies \hat{y} = +1 \ k (\boldsymbol w, \boldsymbol x{new}) + b \leq 0 \implies \hat{y} = -1 \

First of all let's reload the data and redefine or training and test predictors and response variables.

Now let's define the necessary functions for this task:bank_train

1) rbf_k which is the rbf kernel function

2) compute_cost_rbf which computes the cost function for our optimization problem

Now that we have computed the cost function, we can take the derivative with respect to the weights to compute the gradient: $$ \frac{d}{d\boldsymbol w_z} \mathcal L (\boldsymbol w_z ; b) = \frac{d}{d\boldsymbol w_z} \bigg(\frac{1}{2} \| \boldsymbol w_z \|^2 + \frac{\lambda}{n} \sum_{i=1}^n \max \bigg( 0, 1-y_i (\boldsymbol w_z \cdot z_i + b) \bigg) \bigg)\ $$

First term yields the following:

$$ \frac{d}{d\boldsymbol w_z} \frac{1}{2} \| \boldsymbol w_z \|^2 = \boldsymbol w_z $$

Second term inside the summation yields the following:

$$ \frac{d}{d\boldsymbol w_z} 1-y_i (\boldsymbol w_z \cdot z_i + b) = -y_i \frac{d}{d\boldsymbol w_z} \exp \bigg(- \frac{\Vert \boldsymbol w_z - \boldsymbol x_i \rVert^2 }{\sigma^2} \bigg) = -\frac{2}{\sigma^2} y_i (\boldsymbol w_z - \boldsymbol x_i) \exp \bigg(- \frac{\Vert \boldsymbol w_z - \boldsymbol x_i \rVert^2 }{\sigma^2} \bigg) $$

and note that: $$ \gamma = \frac{1}{\sigma^2} $$

Now that we have checked that this model is working, let's implement a 5 fold cross validation and grid search over parameters gamma and b in order to tune our model and maximise its accuracy.

Explore parameters accuracy

First of all, let's play around with the parameters accuracy to see which ranges to pick

Now, with our optimized parameters, let us compute the y_preds for our model

Compute the F1 score

The F1 score is a function of precision and recall as defined in the random forest part:

$$ F1 = 2 * \frac{Precision * Recall}{Precision + Recall} $$

with Precision = True Positive/ (True Positive + False Positive) and Recall = True Positive / (True Positive + False Negative) (5)

Interpretation of F1 score

Our F1 score is relatively high, which generally is a good sign, however, we can see that our confusion matrix displays one 1's, hence our prediction vector for the response variable always classifies as 1. This is not good for the bank as they want to select people with the best criterion for granting credits. The error is probably in my code as I haven't implemented a dual svm.

2.3.2

In this task we are asked to evaluate the performance of our hard margin RBF kernel SVM on the test data using a receiver operating characteristic (also called ROC curve). However as we are predicting all ones, we are expecting our true positive rate and false positive rate to be equal no matter what.

Howver this is not what should have happened because it's very uncommon to be predicting all ones, there must be an error somewhere in my code chunks. Usually SVM with kernel should be our best classification method, but here its implementation was faulty and hence no good results are interpretable.

Overall the conclusion is that hard margin rbf kernel SVM and Random Forests should be performing quite well, however as my predicted response variables were bearing only ones, it is hard to make any conclusions from these results.

References